AI bioinformatics: explaining incorrect command lines
A follow up to my previous article about using LLMs to explain complex bioinformatics command lines.
In the previous experiment we found that GTP5.2 did a pretty good job explaining our three command lines, correctly identifying the job of each option and adding mostly-helpful commentary.
But what about the dreaded hallucinations? Let’s try adding some plausible but invalid options to the command lines and see if the LLM can catch them.
For each I’ll include the complete prompt and the complete response - just click the green response box to expand it and see the full thing. Note: all prompts were given to GPT5.2 with all options on their default setting.
First challenge: adding a nonexistant argument
We will take our original bwa mem command which aligns reads to a reference genome, and add a --parallel argument. There is no such argument to any of the bwa commands - in fact bwa doesn’t use these long style arguments, so it would probably stand out right away to anyone who is familiar with the tool.
Explain the options and arguments in this command line:
bwa mem -t 8 –parallel -R ‘@RG:Sample1:Sample1:illumina’ reference.fasta reads_R1.fastq.gz reads_R2.fastq.gz > alignment.sam
include details on the roles of each file name
The relevant bit of the response is here:
--parallel
- Meaning: enable additional internal parallel behavior (exact behavior can depend on > BWA version/build).
- Effect: typically helps performance when multiple threads are used.
- Caution: if you don’t see a speedup or your version doesn’t recognize it, check
bwa > mem -hfor your installed version’s supported flags (some distributions differ).
The output here is incorrect, but in an interesting way! GPT5.2 fails to explicitly call out the incorrect argument, and makes up a plausible explanation for its existence. However, the output hints at the fact that the LLM is less confident about this parameter than the other ones.
Here’s what every piece of this command does, including what each filename represents:
bwa mem -t 8 --parallel -R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina' reference.fasta reads_R1.fastq.gz reads_R2.fastq.gz > alignment.samProgram + subcommand
bwa: the aligner (Burrows–Wheeler Aligner).mem: the algorithm within BWA intended for mapping relatively long reads (commonly Illumina 70bp–250bp, and longer). It outputs alignments in SAM format.
Options / arguments
-t 8
- Meaning: use 8 CPU threads.
- Effect: speeds up alignment by parallelizing parts of the computation.
- Notes: Threads help throughput, but memory usage can rise somewhat depending on data size and settings.
--parallel
- Meaning: enable additional internal parallel behavior (exact behavior can depend on BWA version/build).
- Effect: typically helps performance when multiple threads are used.
- Caution: if you don’t see a speedup or your version doesn’t recognize it, check
bwa mem -hfor your installed version’s supported flags (some distributions differ).
-R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina'
Adds a read group (RG) line into the SAM header and tags each alignment with that RG.
The string is a literal SAM header line (an @RG header record) with tab-separated fields; \t means “tab”:
@RG: this is a read group header record.ID:Sample1- Read Group ID (must be unique within the file).
- Used by downstream tools to distinguish lanes/runs/libraries when multiple groups exist.
SM:Sample1- Sample name.
- Very important for tools like GATK and many variant callers; it ties reads to the biological sample.
PL:illumina- Platform/technology (e.g., illumina, ont, pacbio).
- Used in some pipelines for platform-specific assumptions or metadata.
Why read groups matter: If you later mark duplicates, do BQSR, or run joint variant calling, correct RG information (often also including LB library and PU platform unit) is important to avoid treating distinct runs/libraries incorrectly.
File arguments (roles of each filename)
reference.fasta
- Role: the reference genome/sequence you are aligning the reads to.
- What BWA expects: a FASTA containing one or more reference contigs/chromosomes.
- Important dependency: BWA typically requires the reference to have been indexed first (e.g.,
bwa index reference.fasta), which creates auxiliary index files alongside it.
reads_R1.fastq.gz
- Role: the forward reads (Read 1) from paired-end sequencing.
- Format: FASTQ, gzip-compressed.
- Content: sequences + base quality scores for the first read in each pair.
reads_R2.fastq.gz
- Role: the reverse reads (Read 2) from paired-end sequencing.
- Format: FASTQ, gzip-compressed.
- Content: sequences + base quality scores for the second read in each pair.
Why order matters: With BWA MEM, giving two FASTQ files means “these are paired-end mates.” The i-th read in R1 is assumed to correspond to the i-th read in R2 (same read name, matching pairs).
> alignment.sam
- Role: this is shell redirection, not a BWA argument.
- Meaning: write BWA’s output (SAM text) into the file
alignment.saminstead of printing it to the terminal. - Output content: alignments in SAM format (text), including header lines (
@HD,@SQ,@RGetc.) and one line per aligned read.
Putting it together (plain-English description)
“Align the paired-end reads (reads_R1.fastq.gz and reads_R2.fastq.gz) to the reference genome (reference.fasta) using BWA-MEM, using 8 threads, annotate alignments with a read group identifying Sample1 on Illumina, and save the resulting SAM alignments to alignment.sam.”
Practical notes (common next steps)
SAM files get large; many pipelines write BAM directly, e.g. pipe to samtools:
bwa mem ... | samtools sort -@ 8 -o alignment.sorted.bam
Make sure read group info matches your experiment. Often you’ll also include:
LB:(library) andPU:(platform unit / flowcell-lane), especially if multiple lanes exist.
If you tell me your sequencing layout (single lane vs multiple lanes, libraries, flowcell/lane IDs), I can suggest a more complete -R line that plays nicely with downstream duplicate marking and variant calling.
Want to see future experiments? sign up for the mailing list:
Second challenge : adding an argument for a different subcomand
Let’s try something a bit trickier. Here we will add an argument -q that applies to a different subcommand. The -q option is valid for bwa aln but not bwa mem.
Explain the options and arguments in this command line:
bwa mem -t 8 -q 25 -R ‘@RG:Sample1:Sample1:illumina’ reference.fasta reads_R1.fastq.gz reads_R2.fastq.gz > alignment.sam
include details on the roles of each file name
Here’s relevant bit of the response:
-q 25
Minimum mapping quality
Reads with a mapping quality (MAPQ) < 25 are:
- Either not output
- Or output as low-quality/unmapped (implementation detail)
MAPQ reflects confidence that a read is correctly aligned.
A value of 25 is moderately strict (≈ 0.3% chance of misalignment).
👉 This is a post-alignment filter, not a base-quality filter.
While there is a lot of good, useful information in this response, it fails to identify that the -q argument isn’t valid, though it gives a good description of its role in bwa aln.
Let’s break this command down piece by piece and explain what each option does and what role each file plays.
bwa mem -t 8 -q 25 -R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina' \
reference.fasta reads_R1.fastq.gz reads_R2.fastq.gz > alignment.sam1. The program and algorithm
bwa mem
bwa: A popular aligner for mapping sequencing reads to a reference genome.mem: The BWA algorithm designed for:- Illumina short reads
- Reads ~70 bp to several kb
- Single-end or paired-end data
It outputs alignments in SAM format by default.
2. Options (flags) and arguments
-t 8
- Threads
- Uses 8 CPU threads in parallel.
- Speeds up alignment on multicore machines.
- Does not change the results, only performance.
-q 25
Minimum mapping quality
Reads with a mapping quality (MAPQ) < 25 are:
- Either not output
- Or output as low-quality/unmapped (implementation detail)
MAPQ reflects confidence that a read is correctly aligned.
A value of 25 is moderately strict (≈ 0.3% chance of misalignment).
👉 This is a post-alignment filter, not a base-quality filter.
-R '@RG\tID:Sample1\tSM:Sample1\tPL:illumina'
- Adds a read group (RG) header to the SAM output.
- Read groups are essential for many downstream tools (e.g., variant calling).
Inside the quoted string:
| Tag | Meaning |
|---|---|
@RG |
Declares a read group |
ID:Sample1 |
Read group ID (must be unique per group) |
SM:Sample1 |
Sample name (used by GATK and others) |
PL:illumina |
Platform (sequencing technology) |
Why \t?
\tinserts tab characters- SAM headers require tabs, not spaces
Without proper read group info, many pipelines will fail or give warnings.
3. Input files and their roles
reference.fasta
The reference genome
All reads are aligned against this sequence
Must be:
- Indexed beforehand (
bwa index reference.fasta)
- Indexed beforehand (
Determines:
- Chromosome names
- Coordinate system
- Alignment positions
reads_R1.fastq.gz
Forward reads (Read 1) from paired-end sequencing
FASTQ format contains:
- Read sequence
- Per-base quality scores
Compressed with gzip
reads_R2.fastq.gz
Reverse reads (Read 2) from paired-end sequencing
Corresponds to the other end of the same DNA fragments as R1
BWA uses pairing information to:
- Improve alignment accuracy
- Detect insert size
- Resolve ambiguous mappings
👉 The order matters: R1 first, R2 second.
4. Output redirection
> alignment.sam
Redirects standard output to a file
alignment.sam:Contains all alignments in SAM (text) format
Includes:
- Header lines (
@HD,@SQ,@RG) - One line per read alignment
- Header lines (
⚠️ SAM files can be very large. Often you’ll see:
bwa mem ... | samtools sort -o alignment.bamto produce a compressed, sorted BAM file instead.
5. High-level summary
What this command does:
- Aligns paired-end Illumina reads to a reference genome
- Uses 8 threads
- Filters out low-confidence alignments (MAPQ < 25)
- Tags all reads as belonging to
Sample1 - Writes alignments to a SAM file
Conceptual flow:
reference.fasta
↑
reads_R1.fastq.gz + reads_R2.fastq.gz
↓
bwa mem
↓
alignment.sam
If you want, I can also:
- Rewrite this command for BAM output
- Explain SAM fields line-by-line
- Show how changing
-qaffects results - Explain paired-end alignment logic in detail
Challenge three : similarity searching
Let’s try our blastn command line. We will add another completely spurious option --precise and see what happens:
Explain the options and arguments in this command line:
blastn -query queries.fasta -db nt -remote –precise -outfmt “6 qseqid sseqid staxids pident length evalue bitscore” -max_target_seqs 5 -evalue 1e-6 -num_threads 8 -out results.tsv
include details on the roles of each file name
and the response:
--precise
Requests higher-accuracy alignments
Trades speed for sensitivity
Useful when:
- Sequences are short
- You care about accurate percent identity
May increase runtime slightly
While the rest of the output is helpful, the bit about --precise is completely made up.
Here’s a clean, line-by-line explanation of what this blastn command does, including what each option controls and the role of every file name.
The full command (reformatted)
blastn \
-query queries.fasta \
-db nt \
--precise \
-remote \
-outfmt "6 qseqid sseqid staxids pident length evalue bitscore" \
-max_target_seqs 5 \
-evalue 1e-6 \
-num_threads 8 \
-out results.tsvWhat blastn is doing overall
This command runs BLASTN (nucleotide–nucleotide BLAST) to:
- Compare DNA sequences in
queries.fasta - Against the NCBI nucleotide database (
nt) - Using NCBI’s remote servers
- Returning up to 5 significant matches per query
- Writing a tab-delimited results file called
results.tsv
File names and their roles
queries.fasta
- Role: Input file (your data)
- Contains one or more query nucleotide sequences
- Must be in FASTA format
- Each sequence in this file will be searched independently against the database
Example:
>query1
ATGCGTACGTTAGC...
nt
- Role: Reference database
- Stands for NCBI nucleotide database
- Contains millions of curated and uncurated DNA sequences from GenBank, RefSeq, etc.
- Because
-remoteis used, this database is not stored locally
results.tsv
- Role: Output file
- A tab-separated text file
- Each line represents one BLAST hit
- Easy to open in Excel, R, Python (pandas), or Unix tools
Explanation of each option
-query queries.fasta
- Specifies the input FASTA file
- BLAST reads each sequence in this file as a separate query
-db nt
- Specifies the database to search against
nt= comprehensive nucleotide collection
--precise
Requests higher-accuracy alignments
Trades speed for sensitivity
Useful when:
- Sequences are short
- You care about accurate percent identity
May increase runtime slightly
-remote
Runs BLAST on NCBI’s servers instead of your local machine
Required when:
- You don’t have
ntdownloaded locally
- You don’t have
Ignores many local performance settings (see
-num_threadsbelow)
-outfmt "6 qseqid sseqid staxids pident length evalue bitscore"
- Controls output format and columns
6 = tabular format (no comments)
The selected fields:
| Field | Meaning |
|---|---|
qseqid |
Query sequence ID (from queries.fasta) |
sseqid |
Subject (database hit) sequence ID |
staxids |
NCBI taxonomic ID(s) of the hit |
pident |
Percent identity of the alignment |
length |
Alignment length (bp) |
evalue |
Expectation value (statistical significance) |
bitscore |
Alignment score (higher = better) |
Each hit becomes one row in results.tsv.
-max_target_seqs 5
- Limits output to at most 5 database hits per query
- Keeps results concise
- Important for large databases like
nt
⚠️ Note: With -remote, this is treated as a soft limit—ordering is approximate.
-evalue 1e-6
Filters out weak matches
Only reports hits expected to occur by chance less than once per million searches
Common thresholds:
1e-3→ permissive1e-6→ moderate/stringent1e-10→ very strict
-num_threads 8
- Requests use of 8 CPU threads
- Important caveat: When using
-remote, this is largely ignored because computation happens on NCBI’s servers - Only truly matters for local BLAST databases
-out results.tsv
- Specifies the output file name
- Overwrites the file if it already exists
- Output is tab-delimited because of
outfmt 6
What the final output represents
Each row = one alignment between:
- A sequence from
queries.fasta - A sequence from the
ntdatabase
- A sequence from
Columns are strictly ordered as defined in
-outfmtMultiple rows per query are possible (up to 5)
In one sentence
This command compares nucleotide sequences in
queries.fastaagainst the NCBIntdatabase using remote BLASTN, applies sensitive search settings, filters for significant matches, and writes the top 5 hits per query to a tab-delimited file calledresults.tsv.
If you want, I can also:
- Help you interpret the results.tsv
- Suggest better parameters for short reads, barcodes, or metagenomics
- Rewrite this for local BLAST databases or faster runs
While the rest of the output is helpful, the bit about --precise is completely made up.
Verdict
So far the picture is not looking great for LLMs (I carried out these experiments with GPT5.2 via the chat interface but I have no reason to suspect that other LLMs will be significantly different). All three incorrect arguments failed to be properly flagged.
This is in very sharp contrast to the previous experiment, which showed incredibly good performance. It’s a constant source of surprise (to me at least) how widly the ability of LLMs varies on apparently similar tasks.
A follow-up for a future experiement: can we tweak the prompt to make it more likely for the AI to catch these kinds of errors?